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1  Introduction 


This  final  report  describes  a  novel  biologically- inspired  cognitive  architecture  that  results 
from  many  discussions  and  collaborative  efforts  by  members  of  this  team  as  well  as  fruitful 
discussions  with  members  of  other  teams  participating  in  the  BICA  Phase  1  program*  As 
part  of  our  Phase  1  efforts,  we  contributed  ideas  to  the  TOSCA  architecture,  which  was  the 
basis  of  our  Phase  11  proposal. 

The  systems  level  architecture  described  in  this  final  report  is  inspired  by  Simmons  and 
Barsalou  (2003),  with  psychologically-inspired  additions  proposed  by  Breazeal  et  al  (2005), 
augmented  by  key  insights  from  developmental  psychology  (Smith,  2005)*  A  cognitive 
agent  will  have  multiple  modality-specific  systems,  including  sensory  systems  (e.g.,  vision, 
audition,  touch),  a  motor  system,  an  emotional  system,  and  a  cognitive  system*  As  a  system 
perceives  its  external  world  and  internal  mental  states,  feature  systems  will  represent  these 
experiences  in  the  relevant  perceptual  modalities,  and  a  hierarchical  system  of  neu rally- 
inspired  association  areas  will  capture  them,  so  that  they  can  be  reenacted  or  simulated  in 
the  future.  These  perceptually  grounded  representations  guide  the  intelligent  and  social 
operations  of  agents. 

Eight  themes  underlie  our  proposed  system,  derived  from  what  is  currently  understood 
about  the  origin  and  developmental  trajectory  of  the  requisite  cognitive  skills  in  humans 
and  brain  science: 

*  The  system  operates  here  and  now  in  the  physical  world* 

■  The  system  is  internally  motivated  to  act  and  learn  (e.g.,  drives  and  affect),  as  well  as 
externally  motivated  through  social  interaction  with  con-specifics 

■  The  system  also  operates  in  the  social  world,  working  with  other  agents,  human  and 
robotic 

*  The  system  understands  other  agents  in  social  terms  (e.g.,  mind-reading,  joint  atten* 
tion,  perspective  taking,  etc.) 

*  The  systems  representations  are  grounded  in  modality-specific  systems  (e.g*,  vision, 
motor,  emotion,  etc*),  in  its  body,  and  in  its  meta-cognition* 

*  The  system  changes  as  a  consequence  of  its  experience,  exhibiting  both  short-term 
adaptation  to  experience  and  long-term  developmental  change.  Imitation  and  other 
forms  of  social  learning  are  particularly  important. 

*  The  systems  overall  architecture  is  neurally  inspired  at  the  systems  level,  based  on 
the  brains  modality-specific  systems  and  association  areas. 

*  Although  the  system  is  built  from  a  network  of  distributed  computational  processes 
{e*gv  neural  nets,  agent-based  architectures,  etc.)  and  modality-specific  representa¬ 
tions,  it  nevertheless  performs  classic  symbolic  operations  that  appear  central  to  hu¬ 
man  intelligence,  such  as  predication  and  conceptual  combination* 
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Our  goal  is  to  break  new  ground  in  designing  an  intelligent  system  whose  cognition 
is  grounded  in  the  brains  modality-specific  system  and  shaped  by  environmental,  social, 
and  developmental  constraints*  Our  approach  is  deeply  informed  by  lessons  from  biology 
about  how  to  build  intelligence*  This  endeavor  not  only  considers  what  is  in  the  head  as 
informed  by  neuroscience  and  psychology,  but  also  considers  the  deep  involvement  of  the 
rest  of  the  body,  as  well  as  the  nature  of  the  environment  the  intelligence  is  situated  within, 
and  the  activities  in  which  it  is  engaged- 


2  Theoretical  Background 

Embodied  Cognition 

During  its  first  few  decades  of  existence,  artificial  intelligence  has  not  only  drawn  from  the¬ 
ories  of  cognitive  psychology,  but  to  a  wide  extent  also  shaped  notions  of  amodal,  symbolic 
information  processing*  According  to  this  view,  information  is  translated  from  perceptual 
stimuli  into  nonperceptual  symbols,  later  used  for  information  retrieval,  decision  making, 
and  action  production*  This  view  also  corresponds  to  much  of  the  work  currently  done  in 
robotics,  where  sensory  input  is  translated  into  semantic  symbols,  which  are  then  operated 
upon  for  the  production  of  motor  control* 

An  increasing  body  of  recent  findings  challenges  this  view  and  suggests  instead  that 
concepts  and  memory  are  phenomena  grounded  in  modal  representations  utilizing  many 
of  the  same  mechanisms  used  during  the  perceptual  process*  A  prominent  theory  ex¬ 
plaining  these  findings  is  one  of  "simulators",  siting  memory  and  recall  in  the  very  neural 
modules  that  govern  perception  itself,  subsequently  used  by  ways  of  "simulation"  or  "im¬ 
agery"  [Barsalou,  1999,  Kosslyn,  1995}.  Perceptual  symbols  are  organized  in  cross-modal 
networks  of  activation  which  are  used  to  dynamically  reconstruct  and  produce  knowledge, 
concepts,  and  decision  making.  This  view  is  supported  by  evidence  of  inter-modal  behav¬ 
ioral  influences,  as  well  as  by  the  detection  of  perceptual  neural  activation  when  a  subject 
is  using  a  certain  concept  in  a  non- perceptual  manner  (e,g*  [Martin,  2001])- 

Thus,  when  memory  or  language  are  invoked  to  produce  behavior,  the  underlying  per¬ 
ceptual  processes  elicit  many  of  the  same  neural  patterns  and  behaviors  normally  used  to 
regulate  perception  [Spivey  et  al*,  2005].  To  name  but  a  few  examples,  it  has  been  shown 
that  reading  a  sentence  that  has  an  implied  orientation  reduces  response  time  on  image 
recognition  that  is  similarly  oriented  [Stanfield  and  Zwaan,  2001];  memory  recall  impair¬ 
ment  is  found  to  match  speech  impediments  in  children  (mistaking  rings  for  wings  in  chil¬ 
dren  that  pronounce  'r's  as  'w's)  [Locke  and  Kutz,  1975];  and  comparing  visually  similar 
variations  of  a  word  (e.g.  a  pony's  mane  and  a  horse's  mane)  is  faster  than  visually  distinct 
variations  (e.g*  a  lion's  mane)  [Barsalou  et  al.,  1999], 

In  parallel  to  a  perception-based  theory  of  cognition  lies  an  understanding  that  cog¬ 
nitive  processes  are  equally  interwoven  with  motor  activity*  Evidence  in  human  devel¬ 
opmental  psychology  shows  that  motor  and  cognitive  development  are  not  parallel  but 
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highly  interdependent.  For  example,  research  showed  that  artificially  enhancing  3-month 
old  infant's  grasping  abilities  (through  the  wearing  of  a  sticky  mitten),  equated  some  of 
their  cognitive  capabilities  to  the  level  of  older,  already  grasping1,  infants  [Somerville  et  aL, 
2004), 

A  related  case  has  been  made  with  regard  to  hand  signals,  which  are  viewed  by  some  as 
foremost  instrumental  to  lexical  lookup  during  language  generation  [Krauss  et  aL,  1996], 
and  is  supported  by  findings  of  redundancy  in  head-movements  [McClave,  2000]  and  facial 
expression  [Chovil,  1992]  during  speech  generation. 

A  large  body  of  work  points  to  an  isomorphic  representation  between  perception  and 
action,  leading  to  mutual  and  often  involuntary  influence  between  the  two  [Wilson,  2001]. 
Some  researches  speak  of  specific  'privileged  loops'  from  perception  to  action,  for  example 
between  speech  and  auditory  perception,  or  visual  perception  and  certain  motor  activ¬ 
ity,  indicating  these  action  systems'  roles  in  the  perceptual  pathway  [McLeod  and  Posner, 
1984). 

In  addition,  neurological  findings  indicate  that — in  some  primates — observing  an  ac¬ 
tion  and  performing  it  causes  activation  in  the  same  cerebral  areas,  a  phenomenon  labeled 
"mirror  neurons"  [Gallese  and  Goldman,  1996].  Similarly,  listening  to  speech  has  been 
shown  to  activate  motor  area  related  to  speech  production  [Wilson  et  aL,  2004],  and  in  pi¬ 
anists,  listening  to  a  tonal  sequence  triggers  neural  activation  in  areas  associate  with  finger 
movement  —  both  for  the  current  tone,  and  for  the  next  tone  in  cases  where  the  subject 
is  familiar  with  the  melody.  This  common  coding  is  thought  to  play  a  role  in  imitation, 
and  the  relation  of  the  behavior  of  others  to  our  own,  which  is  considered  a  central  pro¬ 
cess  in  the  development  of  a  Theory  of  Mind  [Meltzoff  and  Moore,  1997],  and  possibly 
underlie  the  rapid  and  effortless  adaptation  to  a  partner  that  is  needed  to  perform  a  joint 
task  [Sebanz  et  aL,  2006].  For  a  thorough  review  of  these  mirror  neurological  phenomena 
and  the  related  connection  between  perception  and  action,  as  it  pertains  to  human-robot 
interaction,  see  [Mataril,  2002]. 

An  important  insight  of  the  evidence  is  that  perceptual  processing  is  not  a  strictly 
bottom-up  analysis  of  raw  available  data,  as  it  is  often  modeled  in  robotic  systems.  In¬ 
stead,  simulations  of  perceptual  processes  prime  the  acquisition  of  new  perceptual  data, 
motor  knowledge  is  used  in  sensory  parsing,  and  intentions,  goals,  and  expectations  all 
play  a  role  in  the  ability  to  parse  the  world  into  meaningful  objects.  This  seems  to  be  partic¬ 
ularly  true  for  the  parsing  of  human  behavior  in  a  goal-oriented  and  anticipatory  manner, 
a  vital  component  of  joint  action. 

Experimental  data  supports  this  hypothesis,  finding  perception  to  be  predictive  (for  a 
review,  see  [Wilson  and  Knoblich,  2005]).  In  vision,  information  is  sent  both  upstream 
and  downstream,  and  object  priming  triggers  top-down  processing,  biasing  lower-level 
mechanisms  in  sensitivity  and  criterion  [Kosslyn,  1995].  Similarly,  visual  lip-reading  affects 
the  perception  of  auditory  syllables  indicating  that  the  sound  signal  is  not  processed  as  a 

1  In  the  physical  sense. 
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raw  unknown  piece  of  data  [Massaro  and  Cohen,  1983]2*  High-level  visual  processing 
is  also  involved  in  the  perception  of  human  figures  from  point  light  displays,  enabling 
subjects  to  identify  gender  and  identity  from  very  sparse  visual  information  [Thornton 
et  a).,  1998], 

While  the  idea  of  top-down  processing  has  been  utilized  in  some  form  in  past  object 
recognition  systems,  e.g.  [Bregler,  1997,  Hamdan  et  aL,  1999]  and  in  an  amodal  setting  in 
collaborative  planning  systems,  e,g.  [Rich  et  aL,  2001],  the  application  of  an  integrated  top- 
down  implementation  in  a  robotic  behavior  system  is  yet  to  be  tested. 


We  will  use  the  term  "embodied  cognition"  to  relate  to  the  effect  and  interrelation  of  the 
above  mentioned  elements:  (a)  perceptual  symbol  systems,  (b)  integration  between  per¬ 
ception  and  action,  and  (c)  top-down  processing.  We  use  this  overarching  term  to  denote 
an  approach  that  views  mental  processes  not  as  amodal  semantic  symbol  processors  with 
perceptual  inputs  and  motor  outputs,  but  as  integrated  psycho-physical  systems  acting  as 
indivisible  wholes. 


3  Multimodal  Representation  via  Simulators 

This  section  describes  the  main  components  of  the  proposed  research  architecture,  which 
stands  on  three  main  pillars,  based  on  the  theoretical  framework  outlined  above:  (a)  perception- 
based  memory;  (b)  action-perception  activation  networks;  and  (c)  a  motivation-  and  goal- 
based  action  selection  mechanism.  These  three  concepts  are  highly  interrelated:  the  per¬ 
ceptual  memory  model  supports  top-down  simulator-  and  emulator-type  biasing,  which  is 
activated  —  among  others  —  by  the  action -perception  activation  network  connections.  The 
motivation/goal  layer  both  builds  on  the  perception-action  layer,  by  serving  as  a  recourse 
for  unresolved  goals,  and  conversely  triggers  for  simulator-  and  emulator-type  biasing  as 
part  of  the  action  selection  mechanism. 


Perception- Based  Memory 

Grounded  in  perceptual  symbol  theories  of  cognition,  and  in  particular  that  of  simulators 
[Barsalou,  1999J,  memories  and  concepts  are  based  on  perceptual  snapshots  and  reside 
in  the  various  perceptual  systems  rather  than  in  an  amodal  centralized  semantic  network 
(Figure  2),  Incoming  perceptions  are  organized  in  percept  trees  [Blumberg  et  at.,  2002]  for 
each  modality,  organizing  them  on  a  gradient  of  specificity  (Figure  I). 

Memories  are  organized  in  activation  networks,  connecting  them  to  each  other,  as  well 
as  to  concepts,  actions,  plans,  and  intentions  (the  dotted  lines  in  Figure  2).  However,  it 

2 As  cited  in  [Wilson,  2001]. 
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Figure  1:  Percept  trees  organize  sensory  experience  on  a  gradient  of  specificity. 


is  important  to  note  that  connections  between  perceptual,  conceptual,  and  action  mem¬ 
ory  are  not  binary.  Instead,  a  single  perceptual  memory — for  example  the  sound  of  a  cat 
meowing — can  be  more  strongly  connected  to  one  memory — like  the  proprioceptive  mem¬ 
ory  of  tugging  at  a  cat's  tail— or  more  weakly  connected  to  another  memory,  such  as  that 
of  one's  childhood  living  room  scent. 

Perceptions  are  retained  in  memory,  but  decay  over  time.  Instead  of  subscribing  to 
the  classic  division  of  short-  and  long-term  memory,  the  proposed  architecture  advocates 
a  gradient  information -based  decay  in  which  more  specific  perceptual  memory  decays 
faster,  and  more  general  perceptual  information  is  retained  longer.  Decay  is  governed 
by  the  amount  of  storage  needed  to  keep  perceptual  memory  at  various  levels  of  speci¬ 
ficity,  retaining  more  specific  memories  for  a  shorter  period  of  time,  and  more  abstract — or 
filtered — perceptions  longer. 

These  retentions  are  also  influenced  by  the  relevance  of  the  perceptual  information  to 
the  current  task  and  their  attentive  saliency;  perceptual  elements  that  are  specifically  at¬ 
tended  to  at  time  of  acquisition  will  be  retained  longer  than  ones  that  were  peripheral  at 
that  time.  In  a  similar  vein,  affective  states  of  the  system  also  influence  memory  retention3. 

For  example  in  the  visual  processing  pathway,  raw  images  are  retained  for  a  short  pe¬ 
riod  of  time,  while  colors,  line  orientations,  blob  location,  and  fully  recognized  objects  re¬ 
main  in  memory  for  a  longer  time  span  (Figure  3), 

Simulators  and  Production 

A  key  capability  of  perceptual  memory  is  the  production  of  new  perceptual  patterns  and 
concepts.  Using  simulation,  the  activation  of  a  perceptual  symbol  can  evoke  the  construc¬ 
tion  of  both  a  past  and  a  fictional  situation.  Citing  (Barsalou,  1999] ;  "Productivity  in  percep¬ 
tual  symbol  systems  is  approximately  the  symbol  formation  process  run  in  reverse.  During 

^Such  an  approach  could  explain  why  certain  marginal  perceptions  are  retained  for  a  long  time  if  they  occurred 
in  a  traumatic  setting. 
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weighted  adivation  connection 
decaying  perceplual  memory 


Figure  2:  Perception-Based  Memory 
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time 


Figure  3:  Perceptual  memory  is  filtered  up  in  specificity  and  remains  in  memory  for  a 
variable  amount  of  time,  based — among  others — on  the  information  requirements  at  each 
specificity  level. 


symbol  formation,  large  amounts  of  information  are  filtered  out  of  perceptual  representa¬ 
tions  to  form  a  schematic  representation  of  a  selected  aspect.  During  productivity,  small 
amounts  of  the  information  filtered  out  are  added  back." 

One  mechanism  enabling  the  top-down  processing  of  perceptual  memory  is  the  implicit 
connection  between  feature  detecting  percepts  within  a  certain  modality.  As  illustrated  in 
Figure  4,  an  activation  of  a  red  detector  in  a  perceptual  concept  implicitly  activates  other 
percepts  which  have  activated  the  same  node  (as  these  percept  nodes  model  the  same  un¬ 
derlying  neural  structure).  Based  on  the  weight  of  the  connection  between  the  activated 
feature  and  the  perceptual  snapshot,  different  concepts  are  activated  to  a  varying  extent. 

A  second  activation  mechanism  operates  on  a  higher  level,  linking  percepts  of  different 
modalities  using  weighted  connections  (Figure  5).  Weights  of  connectivity  are  learned  and 
refined  over  time.  Their  value  is  influenced  by  a  number  of  factors,  such  as  frequency  of 
co-ocurrance,  attention  while  creating  the  perceptual  memory,  and  affective  states  during 
perception. 

In  order  to  resolve  a  multitude  of  (possibly  contradictory)  perceptual  activations  using 
limited  processing  capacity,  perceptual  memories  can  inhibit  one  another.  A  more  domi¬ 
nant  activation  of  a  certain  perceptual  symbol  (for  example  —  the  sound  "nose"  activating 
imagery  of  a  human  nose  rather  than  an  airplane's  nose)  will  inhibit  a  less  dominant  activa¬ 
tion.  However,  to  enable  less  dominant  activations  after  a  certain  time  (based,  for  example. 
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Figure  4:  Simulation  activates  percepts  if  an  active  node  was  triggered  in  another  percept. 


on  the  monitoring  of  higher  level  goals),  a  self-inhibiting  process  can  be  used  to  decay  the 
dominance  of  active  perceptual  symbols. 

As  part  of  the  research  architecture  proposed  here,  we  have  begun  implementing  a 
perception -based  associative  memory  model  which  supports  top-down  biasing  of  lower- 
level  perceptual  models.  In  the  example  depicted  in  Figure  7,  an  auditory  percept  (the 
sound  "Elmo")  activates  a  canonical  visual  memory  of  the  figure,  which  —  using  the  same 
pathway  utilized  in  visual  perception  —  detects  the  dominant  color  of  the  image.  This 
color  is  then  used  as  a  bias  affecting  the  low-level  visual  buffer,  shifting  it  towards  detec¬ 
tion  of  similarly  colored  areas,  eventually  priming  the  system  to  detect  the  Elmo  puppet 
in  the  visual  field  more  easily.  This  alternative  interpretation  of  the  concept  of  anticipa¬ 
tory  behavior  illustrates  the  potential  perceptual  capabilities  of  a  simulator-based  memory 
model. 
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Figure  6:  Competing  perceptual  symbols  are  incrementally  resolved  using  dominant  sym¬ 
bol  self-inhibitory  decay. 


Emulators 

Based  on  behavioral  and  neurological  finding  in  humans  —  and  of  their  apparent  impor¬ 
tance  to  joint  action  —  perceptual  activation  and  simulation  does  not  only  occur  in  ret¬ 
rospect,  but  also  prospectively,  in  forms  of  so-called  “emulators"  [Wilson  and  Knoblich, 
2005].  In  the  proposed  system,  emulators  are  modeled  as  a  parallel  perceptual  activation 
system  projecting  expectations  of  perceptual  experiences  at  an  atomic  time  scale.  Emula¬ 
tors  can  also  run  on  representations  of  procedural  knowledge  to  engage  in  plan  recognition 
when  observing  a  conspecifk,  or  to  anticipate  the  potential  outcome  of  its  own  actions. 


Action -Perception  Activation  Networks 

Not  only  perceptions  are  situated  in  weighted  activation  networks,  but  these  are  also  con¬ 
nected  to  motor  actions  (see:  Figure  2).  This  is  in  line  with  the  action-perception  integra¬ 
tion  approach  laid  out  earlier.  Inspired  among  others  by  architectures  such  as  Contention 
Scheduling  [Cooper  and  Shalliee,  2000],  activities  and  perceptual  concepts  occur  in  a  per¬ 
petually  updating  activation  relationship. 

To  elaborate  on  the  operation  of  such  networks:  current  perceptions  exert  a  weighted  in¬ 
fluence  on  activities,  leading  both  to  a  potential  action  selection,  as  well  as  to  the  activation 
of  an  attention  mechanism  which  in  turn  refines  the  perceptual  task,  and  aids  its  success. 
Thus,  for  example,  the  presentation  of  a  screwdriver  may  activate  a  grasping  action,  as  well 
as  the  activity  of  applying  a  screwdriver  to  a  screw.  This  will  activate  a  perceptual  simula¬ 
tor  guiding  the  visual  search  towards  a  screw-like  object,  and  the  motor  memory  related  to 
a  clockwise  rotation  of  the  wrist. 
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Figure  7:  Top-down  processing  and  cross-modal  activation  in  a  perceptual-based  associa¬ 
tive  memory  model* 


Figure  8:  Perceptual  symbols  and  motor  activity  are  interconnected  with  weighted  activa¬ 
tions* 


In  accordance  with  behavioral  evidence  [McLeod  and  Posner,  1984,  Wilson,  2001],  cer¬ 
tain  modalities  of  perception  are  more  strongly  connected  to  particular  motor  subsystems, 
resulting  in  so-called  privileged  loops  between  perception  and  action *  Disregarding  to  the  dis¬ 
cussion  of  innateness  or  development  of  such  connections,  these  ''loops"  are  found  in  hu¬ 
mans  to  exist,  for  example,  between  auditory  perception  and  speech  production,  as  well  as 
between  the  perception  of  conspedfics'  actions  and  own  body  motor  system*  This  thesis 
proposes  that  these  mechanisms  play  a  significant  role  in  practiced  activity  and  are  instru¬ 
mental  to  the  quick  response  time  necessitated  by  fluent  joint  action* 

Specifically,  it  is  a  hypothesis  of  this  thesis,  that  such  privileged  loops  as  well  as  other 
highly  weighted  connections  within  and  between  perceptual  and  motor  activation  patterns 
give  rise  to  some  of  the  nonverbal  behavior  that  is  necessary  for  fluent  joint  action.  The 
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time-shortest  path  from  a  perception  to  an  action  might  'lead  through"  a  certain  set  of 
perceptual  and  motor  activation,  which  may  double  as  a  coordination  device  for  the  swift 
success  of  a  joint  activity. 


Figure  9:  Stronger  connections  between  perception  and  action  result  in  so-called  privileged 
loops. 


Prediction  and  Anticipated  Action 

Among  other  factors,  successful  coordinated  action  has  been  linked  to  the  formation  of 
expectations  of  each  partner's  actions  by  the  other  [Flanagan  and  Johansson,  2003],  and  the 
subsequent  acting  on  these  expectations  [Knoblich  and  Jordan,  2003].  We  have  presented 
initial  work  aimed  at  understanding  possible  mechanisms  to  achieve  this  behavior  in  a 
collaborative  robot  [Hoffman  and  Breazeal,  2006].  A  further  research  goal  is  to  evaluate 
whether  more  fluid  anticipatory  action  can  emerge  from  an  approach  that  uses  perception- 
action  activation  as  its  underlying  principle,  and  what  directive  role  anticipation  can  play 
in  the  formation  of  perceptual  procession  and  fluent  joint  action. 

In  the  proposed  framework,  anticipation  appears  on  the  perceptual  level  by  the  use  of 
simulators,  biasing  incoming  perceptual  data;  and  emulators,  predictively  building  per¬ 
ceptual  models  of  temporal  sequences.  With  the  introduction  of  action  sequences,  antici¬ 
pation  can  also  be  formed  by  the  cross-activation  of  perception  and  action,  possibly  using 
a  Bayesian  model  of  action  transitions,  and  preemptively  activating  perceptual  symbols 
based  on  the  anticipation  of  actions  in  such  sequences. 

By  modeling  goals  and  intentions  of  the  human  collaborator,  using  a  simulation- theoretic 
framework  (as,  for  example,  described  in  [Gray,  2004]),  anticipation  can  occur  on  higher 
task  levels,  "trickling  down"  through  action  detection  mechanisms  to  the  action -perception 
activation  network. 


Self  As  Simulator  for  Mindreading 


Simulation  Theory  {ST)  is  one  of  the  dominant  hypotheses  about  the  nature  of  the  cogni¬ 
tive  mechanisms  that  underlie  theory  of  mind  [Davies  and  Stone,  1995,  Gordon,  1986,  Heal, 
20Q3[.  Simulation  Theory  posits  that  by  simulating  another  persons  actions  and  the  stimuli 
they  are  experiencing  using  our  own  behavioral  and  stimulus  processing  mechanisms,  hu¬ 
mans  can  make  predictions  about  the  behaviors  and  mental  states  of  others  based  on  the 
mental  states  and  behaviors  that  we  would  possess  in  their  situation.  In  short,  by  thinking 
as  if  we  were  the  other  person,  we  can  use  our  own  cognitive,  behavioral,  and  motivational 
systems  to  understand  what  is  going  on  in  the  heads  of  others. 

Andrew  Meltzoff  proposes  that  the  way  in  which  infants  learn  to  simulate  others  is 
through  imitative  interactions.  Meltzoff  posits  that  infants  are  in  fact  intrinsically  moti¬ 
vated  to  imitate  their  conspecifics,  and  that  the  act  of  successful  imitation  is  its  own  reward. 

For  instance,  Meltzoff  hypothesizes  that  the  human  infants  ability  to  translate  the  per¬ 
ception  of  anothers  action  into  the  production  of  their  own  action  provides  a  basis  for  learn¬ 
ing  about  self-other  similarities,  and  for  learning  the  connection  between  behaviors  and  the 
mental  states  producing  them  [Meltzoff,  1996],  To  begin  with,  imitating  anothers  expres¬ 
sion  or  movement  is  a  literal  simulation  of  their  behavior.  By  physically  copying  what  the 
adult  is  doing,  the  infant  must,  in  a  primitive  sense,  generate  many  of  the  same  mental 
phenomena  the  adult  is  experiencing,  such  as  the  motor  plans  for  the  movement.  Meltzoff 
notes  that  the  extent  to  which  a  motor  plan  can  be  considered  a  low-level  intention,  imita¬ 
tion  provides  the  opportunity  to  begin  learning  connections  between  perceived  behaviors 
and  the  intentions  that  produce  them.  Emotional  empathy  is  one  of  the  earliest  forms  of  so¬ 
cial  understanding  that  imitation  could  facilitate.  Experiments  have  shown  that  producing 
a  facial  expression  generally  associated  with  a  particular  emotion  is  sufficient  for  eliciting 
that  emotion  [Strack  et  aL,  1988].  Hence,  simply  mimicking  the  facial  expressions  of  others 
could  cause  the  infant  to  feel  what  the  other  is  feeling,  thereby  allowing  the  infant  to  learn 
how  to  interpret  emotional  states  of  others  from  facial  expressions  and  body  language. 

Mirror  neurons  have  been  proposed  as  a  possible  neurological  mechanism  underly¬ 
ing  both  imitative  abilities  and  Simulation  Theory-type  prediction  of  others  behaviors  and 
mental  states  [Gallese  and  Goldman,  1998],  Within  area  F5  of  the  monkeys  premotor  cor¬ 
tex,  these  neurons  show  similar  activity  both  when  a  primate  observes  a  goal-directed  ac¬ 
tion  of  another  (such  as  grasping  or  manipulating  an  object),  and  when  it  carries  out  that 
same  goal-directed  action  [Rizzolatti  et  aL,  1996,  Gallese  et  al,  1996].  This  firing  pattern  has 
led  researchers  to  hypothesize  that  there  exists  a  common  coding  between  perceived  and 
generated  actions.  Meltzoff  argues  that  this  structure  is  represented  within  an  interm odal 
space  into  which  infants  are  able  to  map  all  expressions  and  movements  that  they  perceive, 
regardless  of  their  source.  In  other  words,  the  intermodal  space  functions  as  a  universal 
format  for  representing  gestures  and  poses — those  the  infant  feels  himself  doing,  and  those 
he  sees  the  adult  carrying  out.  The  universal  format  is  in  terms  of  the  movement  primitives 
within  his  act  space.  Thus  the  perceived  expression  is  translated  into  the  same  movement 
representation  that  the  infants  motor  system  uses  (recall  the  discussion  of  minor  neurons  in 


section  33)  making  their  comparison  much  simpler.  The  imitative  link  between  movement 
perception  and  production  is  forged  in  the  intermodal  space. 

Inspired  by  these  theories  and  findings,  our  simulation- theoretic  approach  and  im¬ 
plementation  enables  the  cognitive  agent  to  monitor  an  adjacent  conspecific  (i.e,,  the  hu¬ 
man  trainer  in  the  simulation)  by  simulating  his  or  her  behavior  within  the  agent's  own 
generative  mechanisms  on  the  motor,  goal-directed  behavior,  and  perceptual-belief  levels. 
Our  implementation  computationally  models  simulation- theoretic  mechanisms  through¬ 
out  several  systems  within  the  overall  cognitive  architecture.  For  instance,  within  the  motor 
system,  mirror-neuron  inspired  mechanisms  are  used  to  map  and  represent  perceived  body 
positions  of  another  into  the  cognitive  agent's  own  joint  space  to  conduct  action  recogni¬ 
tion.  The  agent  reuses  its  be  lief -construction  systems  from  the  visual  perspective  of  the 
human-teacher  to  predict  the  beliefs  the  human  is  likely  to  hold  to  be  true  given  what  he  or 
she  can  visually  observe.  Finally,  within  the  goal-directed  behavior  system,  where  schemas 
relate  preconditions  and  actions  with  desired  outcomes  and  are  organized  to  represent  hier¬ 
archical  tasks,  motor  information  is  used  along  with  perceptual  and  other  contextual  clues 
(Le.,  task  knowledge)  to  infer  the  human's  goals  and  how  he  or  she  might  be  trying  to 
achieve  them  (i.e.,  plan  recognition). 

The  general  methodology  is  summarized  as  follows: 

*  Map  the  human's  actions  onto  the  cognitive  agent's  motor  representations  via  the 
mirror  system  within  the  cognitive  agent's  perception-activation  networks.  Tag  these 
actions  as  coming  from  "other." 

*  Use  dual-pathways  from  the  motor  to  the  cognitive  systems  to  pass  this  informa¬ 
tion  to  those  systems  that  evoke  these  movements.  For  instance,  movements  used 
to  achieve  task  goals  would  pass  up  to  the  goal-directed  behavior  system  that  run 
emulators  over  procedural  representations. 

*  Consider  the  perceptual  context  from  the  human's  visual  perspective.  Based  on  this, 
hypothesize  what  their  likely  beliefs  are  about  the  perceptual  context  by  running  sim¬ 
ulators  in  the  perceptual  activation  networks.  Tag  these  as  coming  from  "other". 

*  Consider  any  other  relevant  contextual  information,  such  as  task  knowledge. 

*  With  this  dual-pathway  information,  use  the  cognitive  systems  as  a  emulator  run¬ 
ning  on  these  "other-derived"  inputs  to  infer  the  likely  goals  or  beliefs  that  would 
arise  given  these  circumstances  within  the  associated  systems.  (Note  that  multiple 
hypotheses  could  be  generated  and  weighted  probabilistically  to  indicate  how  likely 
they  are). 

*  Use  this  information  to  predict  the  human's  behavior,  and  to  shape  the  agent's  own 
responses. 

*  Incorrect  inferences  present  an  opportunity  for  learning.  Ideally,  even  incorrect  infer¬ 
ences  should  at  some  level  seem  plausible  to  the  human.  This  will  assist  with  efficient 
error  recovery  and  reduce  the  chances  the  same  mistake  is  made  in  the  future. 
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Intentions,  Motivations,  and  Supervision 


It  should  be  noted  that  much  of  the  above  architecture  is  predominantly  apt  for  govern- 
tng  automatic  or  routine  activity.  Perceptions  precipitate  concepts  and  object  memories, 
which  in  turn  prompt  actions  governing  future  perception  and  motor  activation.  While 
this  model  can  be  adequate  for  well-established  behavior,  an  agent  acting  jointly  with  a 
human  must  also  behave  in  concordance  with  internal  motivations  and  intentions,  and 
higher-level  supervision.  This  is  particularly  crucial  because  humans  naturally  assign  in¬ 
ternal  states  and  intentions  to  animate  and  even  inanimate  objects  1  Baldwin  and  Baird, 
2001,  Dennett,  1987,  Malle  et  aL,  2001],  Our  cognitive  agent  acting  with  people  must  behave 
according  to  internal  drives  as  well  as  dearly  communicate  these  drives  to  their  human 
counterpart. 

In  the  proposed  architecture,  supervisory  intention-  and  motivation -based  or  affect- 
based  control  affects  the  autonomic  processing  scheme  outlined  above  (Figure  10).  At  the 
base  of  this  supervisory  system  He  core  drives,  such  as  hunger,  boredom,  attention-seeking, 
and  other  domain- specific  drives  (such  as  task  success),  modeled  as  scalar  fluents,  which 
the  agent  seeks  to  maintain  at  an  optimal  level  (similar  to  [Breazeal,  2002]),  Emotional  fac¬ 
tors  can  also  contribute  at  this  level.  If  any  of  those  fluents  falls  above  or  below  the  defined 
range,  they  trigger  a  goal  request.  The  number  and  identity  of  the  agent's  motivational 
drives  are  fixed,  but  the  emotional  system  adds  flexibility  and  adaptabilitiy.  The  Goal  Ar¬ 
bitration  module  decides  what  goal(s)  to  pursue  based  on  situational  context  and  internal 
state  (e.g.,  current  goals,  motivations  and  emotions). 

Once  a  goal  is  selected,  a  deliberative  processes  follow  to  construct  a  plan  that  satisfies 
the  goal  A  plan  can  then  activate  or  override  the  automatic  action  selection  mechanisms 
in  the  action-perception  activation  network,  activating  simulators  and  guiding  perception 
and  motor  activity,  according  to  the  principles  laid  out  in  [Cooper  and  Shallice,  2000].  The 
outcome  of  actions  is  fed  back  through  perceptual  acquisition  to  the  motivational  system. 
Note  that  this  model  can  also  be  used  as  a  basis  for  an  experienced -based  intention  reading 
framework,  as  outlined  in  the  previous  section 


Practice 

According  to  the  model  put  forth  here,  practice  operates  on  two  levels:  it  alters  the  weights 
of  connections  within  the  perception  and  motor  activation  networks  and  thus  transfers 
deliberate  action  selection  procedures  (stemming  from  the  motivational  layer)  to  a  faster 
action  activation  mechanism;  practice  also  helps  shape  the  anticipatory  action  selection 
mechanism  by  reinforcing  transition  probabilities  and  time  estimates  which  in  tum  are 
being  used  by  emulators  to  predict  future  perceptual  and  motor  operations. 
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Figure  10:  Motivational  drives  are  used  to  form  plans  overriding  or  swaying  action  selec¬ 
tion  in  the  Action-Perception  Activation  Network. 
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4  Conclusion 


As  part  of  our  phase  I  effort,  we  began  to  computationally  implement  this  novel  biologi¬ 
cally  inspired  cognitive  architecture  described  in  this  document.  It  is  the  first  attempt  to 
computationally  model  these  Barsalou-like  simulators  and  test  them  within  behaving  cog* 
nitive  agents,  namely  robots. 

We  have  tested  an  early  version  of  this  architecture  on  two  physical  robot  systems: 
Leonardo  (a  65  degree  of  freedom  humanoid  robot),  and  AUR  (a  6  degree  of  freedom 
robotic  desk  lamp).  Both  have  been  used  to  explore  the  dynamics  of  this  architecture  as 
it  applies  to  improving  the  fluency  of  joint  action  through  practice.  In  the  case  of  Leonardo, 
a  game  of  patty-cake  was  used  to  test  the  robot's  use  of  emulators  to  predict  and  anticipate 
a  sequence  of  actions  practiced  with  a  human.  In  the  case  of  AUR,  the  same  architecture 
was  used  to  have  the  robotic  lamp  anticipate  what  part  of  a  dimly  lit  workspace  to  illumi¬ 
nate  based  on  the  human's  needs  during  a  collaborative  task.  The  final  results  of  this  effort 
shall  be  published  as  part  of  Guy  Hoffman's  PhD  Thesis:  Fluency  and  Embodiment  for  Robots 
Acting  with  Humans  to  be  completed  in  August  2007. 
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